Main packages used:
haven,readxl,readr
Main functions covered:readr::read_csv,readxl::read_excel,haven::read_statahaven::read_spss,haven::read_sas,write.table()
Supplementary resources:
We saw how we can create data within R, most of the time you need to load your dataset into R to start the analysis.
At this point we want more than what base R can offer to us. Let’s install and load some packages! Packages are the cornerstone of the R ecosystem: there are thousands of super useful packages (the most common repository for them is CRAN). Whenever you face a specific problem (that can be highly domain specific) there is a good chance that there is at least one package that offers a solution.
An R package is a collection of functions that works much the same way as we saw earlier. These functions and packages are written by R users and shared with the community. The focus and range of these packages are wide: from data cleaning, to data visualization, through ecological and environmental data analysis there is a package for everyone and everything. This ample supply might be intimidating first but this also means that there is a solution out there to a given problem.
To install a package from the CRAN repository we will use the install.packages() function. Note that it requres the package’s name as a character.
# data import / export
install.packages("readr")
install.packages("haven")
install.packages("readxl")
# exploring data
install.packages("gapminder")
install.packages("dplyr")
install.packages("ggplot2")After you installed a given package we need to load it to be able to use its functions. We do this by the library() command. It is good practice that you load all the packages at the beggining of your script.
# data import / export
library(readr) # importing various data formats into R
library(haven) # importing data formats from other programs (Stata, SAS, SPSS)
#> Warning: package 'haven' was built under R version 3.4.4
library(readxl) # importing Excel tables into R
#> Warning: package 'readxl' was built under R version 3.4.4
# exploring data
library(gapminder) # data we will use
library(dplyr) # data manipulation package
#> Warning: package 'dplyr' was built under R version 3.4.4
library(ggplot2) # data visulization package
#> Warning: package 'ggplot2' was built under R version 3.4.4Important note: whenever there is a conflicting function name (e.g:two packages have the same function name) you can specify what function you want to use with the package::function syntax. Below, when loading in the data, I use readr::read_csv to signify which package the function comes from.
We will look at the Quality of Government basic data set and import it with different file extensions. First let’s load the .csv file (stands for comma separated values). You can either load it from your project folder or directly from the GitHub repo. We are using the readr package that can read comma separated values with the read_csv function. It is a specific case of the read_delim function, where you can specify the character that is used as a delimiter (e.g.: in Europe comma is used as a decimal, so the delimiter is often a semicolon.)
(the codebook for the dataset is here, if you are interested: https://www.qogdata.pol.gu.se/data/qog_bas_jan18.pdf)
Loading it from the GitHub repository requires only the url of the data file and then the read_csv function will take care of the rest.
qog_text <- readr::read_csv("https://rawgit.com/aakosm/r_basics_ecpr18/master/qog_bas_cs_jan18.csv")If you have the data downloaded you can load it from your project folder. In the code below, I put the data file into the data folder within my project folder. (the path looks like this: mydrive:/folder/project_folder/data). The "\data\file.csv is called the relative path, as when using project we do not need to type out the whole path to the file, just its relative location to our main project folder.
qog_text <- read_delim("data/qog_bas_cs_jan18.csv", delim = ",")With the readr::read_csv I specified that I use the function from that specific package. The package::function is useful if there are conflicting functions in the loaded packages or you want to make your package use explicit when functions have very similar names. In this case, base R also have a read.csv function, that is a bit slower than the one in readr.
Next we are loading the excel file. We use the readxl package’s read_excel function to load the file (it does not support opening files via urls unfortunately).
qog_excel <- read_excel("data/qog_bas_cs_jan18.xlsx")Importing data files from Stata 13+, SPSS, and SAS is similarly easy, using the haven package. If you have a data file that is not in these formats (or collaborators who work with weird software choices) you can check the foreign and rio packages. You also have some capability in base R but it is quite picky about software versions (check the read.* functions).
# read the Stata .dta file
qog_stata <- read_stata("data/qog_bas_cs_jan18.dta")
# read the SPSS .sav file
qog_spss <- read_spss("data/qog_bas_cs_jan18.sav")
# read the SAS .SAS7BDAT file
beer_sas <- read_sas("data/beer.sas7bdat")To remove the unnecesary objects, we can use the rm() function.
rm(beer_sas, qog_excel, qog_stata, qog_spss)To remove everything from your environment, you can use the rm(list=ls()). Note: this will only delete the objects in your environment, but the packages stay loaded!
There are loads of ways to load data to R via APIs, from various sources. This subsection builds on the excellent blogpost of Data Acquisition in R. We are not going into details (if you are interested, give the linked post a read!).
The eurostat package let’s you import data from the…… Eurostat!
The wbstats is an API for World Bank data
The OECDpackage is providing the API for the OECD database
The WID package is for the World Wealth and Income Database. It is a bit trickier to download though than our trusted install.packages(). Since it is not up on the CRAN repository we need to directly download the package from their developers’ github page, which contains the source code to it. The devtools package allows just that.
install.packages("devtools")
devtools::install_github("WIDworld/wid-r-tool")
library(wid)QUICK EXCERCISE: read the help of the
read_excelfunction and import in data to a new object where only the first 5 columns are imported.
Solution:
#> # A tibble: 194 x 5
#> ccode cname ccodealp ccodecow ccodewb
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 4.00 Afghanistan AFG 700 4.00
#> 2 8.00 Albania ALB 339 8.00
#> 3 12.0 Algeria DZA 615 12.0
#> 4 20.0 Andorra AND 232 20.0
#> 5 24.0 Angola AGO 540 24.0
#> 6 28.0 Antigua and Barbuda ATG 58.0 28.0
#> 7 31.0 Azerbaijan AZE 373 31.0
#> 8 32.0 Argentina ARG 160 32.0
#> 9 36.0 Australia AUS 900 36.0
#> 10 40.0 Austria AUT 305 40.0
#> # ... with 184 more rows
Let’s export our Quality of government data out of R. We can use write.table() to create a .csv file in our project directory.
write.table(qog_text, "qog_export.csv", sep = ",")Or we can use R’s own data format, the .Rda which is more economical in hard disk space usage. However, if you intend to share your exported data, I’d recommend the .csv output of write.table.
The peculiarity of save() is that it can save any R object, not just a data frame, so if you want to reuse a given object later and do not want to spend computational time recreating it every time, you can save it as well. It will also save the file into your working directory.
save(qog_text, file = "qog_rda.Rda")Main packages used:
base,ggplot2,dplyr
Main functions covered:head(),tail(),dplyr::glimpse()str(),summary(),table(),dplyr::group_byandsummarise(),ggplot::ggplot()
Supplementary resources: ggplot2 cheat sheet
We will use the gapminder package that contains macro data from the famous Hans Rosling presentation on life expectancy and regional convergence.
gapminder_df <- gapminderThese data sets are small enough that you can check them in RStudio’s data viewer (or with the View() function). However, if you have a bigger dataset or smaller memory, it can be problematic. The most basic and quickest way to check if your data loads properly is the head() function. It shows you the first few rows and columns. tail() shows you the end of the data set.
head(gapminder_df)#> # A tibble: 6 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779
#> 2 Afghanistan Asia 1957 30.3 9240934 821
#> 3 Afghanistan Asia 1962 32.0 10267083 853
#> 4 Afghanistan Asia 1967 34.0 11537966 836
#> 5 Afghanistan Asia 1972 36.1 13079460 740
#> 6 Afghanistan Asia 1977 38.4 14880372 786
# let's check the end of our data
tail(gapminder_df)#> # A tibble: 6 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Zimbabwe Africa 1982 60.4 7636524 789
#> 2 Zimbabwe Africa 1987 62.4 9216418 706
#> 3 Zimbabwe Africa 1992 60.4 10704340 693
#> 4 Zimbabwe Africa 1997 46.8 11404948 792
#> 5 Zimbabwe Africa 2002 40.0 11926563 672
#> 6 Zimbabwe Africa 2007 43.5 12311143 470
If you are interested only a certain number of observations and columns, you can specify that as well with the head function or by indexing your data frame.
head(gapminder_df, 3)
#> # A tibble: 3 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779
#> 2 Afghanistan Asia 1957 30.3 9240934 821
#> 3 Afghanistan Asia 1962 32.0 10267083 853or check the 1:3 rows of the first two variables.
head(gapminder_df[1:3, 1:2])
#> # A tibble: 3 x 2
#> country continent
#> <fct> <fct>
#> 1 Afghanistan Asia
#> 2 Afghanistan Asia
#> 3 Afghanistan AsiaIf you want to have a quick overview of your data, both str() and dplyr::glimpse are helpful.
str(gapminder_df)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
#> $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#> $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
#> $ lifeExp : num 28.8 30.3 32 34 36.1 ...
#> $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
#> $ gdpPercap: num 779 821 853 836 740 ...Both the str and glimpse function print out parts of your data set, the main differences between them and the head is they give you a complete look at your list of variables and their types.
dplyr::glimpse(gapminder_df)
#> Observations: 1,704
#> Variables: 6
#> $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
#> $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
#> $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
#> $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...You can use str() on any object to see its structure (we’ll come back to this later).
Now that we know we imported our data properly and have some sense of its dimensions (check what the dim() function does!) let’s have a more in depth look!
For starters, use summary() to check some basic descriptives of our variables.
summary(gapminder_df)
#> country continent year lifeExp
#> Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
#> Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
#> Algeria : 12 Asia :396 Median :1980 Median :60.71
#> Angola : 12 Europe :360 Mean :1980 Mean :59.47
#> Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
#> Australia : 12 Max. :2007 Max. :82.60
#> (Other) :1632
#> pop gdpPercap
#> Min. :6.001e+04 Min. : 241.2
#> 1st Qu.:2.794e+06 1st Qu.: 1202.1
#> Median :7.024e+06 Median : 3531.8
#> Mean :2.960e+07 Mean : 7215.3
#> 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
#> Max. :1.319e+09 Max. :113523.1
#> For the categorical variables you can use table() which generates a frequency table.
frq_table <- table(gapminder_df$continent)
frq_table
#>
#> Africa Americas Asia Europe Oceania
#> 624 300 396 360 24Now we dip into the tidyverse with the dplyr package.
disclaimer: I will emphasise
tidyversethroughout the course, since it is one of the most useful meta-package: a collection of packages that helps to develop consistent workflow from data import to data cleaning to analysis. We will spend more time on this during the coming sessions. The tidyverse packages are a great stepping stone into R and later on you for more complex problems you can turn to tools that are outside of these packages.
https://stackoverflow.blog/2017/10/10/impressive-growth-r/
One key advantage of the tidyverse packages is their compatibility with the use of the pipe operator: %>% (shortcut: Ctrl + Shift + M). It combines code and makes it more intuitive to work with data. Think of the %>% as a “then”. This is just a short intro, so don’t worry if this might be confusing. We will be using dplyr and other tidyverse packages a lot so you’ll have enough time to get comfortable. For every “tidy” function and solution, there exists a base R or other package so if you find something unintuitive or problematic, you can have alternative solutions if you search for it.
The %>% essentially pipes the data on it’s left as the first argument for function after it.
In our first step, we group our data by continent and then create a summary variable, that tells us the mean GDP for each continent. Check the code snippet below to see the role of each line.
# let's see the GDP by continents
gdp_cont <- gapminder_df %>% # we will use the gapminder data
dplyr::group_by(continent) %>% # then group the data by continent
dplyr::summarise(mean_gdp = mean(gdpPercap)) # then create a `mean_gdp` variable where we compute the mean of the grouped gdpPercap variable.
# let's see the result.
gdp_cont
#> # A tibble: 5 x 2
#> continent mean_gdp
#> <fct> <dbl>
#> 1 Africa 2194
#> 2 Americas 7136
#> 3 Asia 7902
#> 4 Europe 14469
#> 5 Oceania 18622You can also create your own summary.
# the number of observations and mean for the GDP per capita variable
gapminder_df %>%
summarise(n = n(), gdp_mean = mean(gdpPercap))
#> # A tibble: 1 x 2
#> n gdp_mean
#> <int> <dbl>
#> 1 1704 7215QUICK EXCERCISE: combine the two approach and check the average, minimum, and maximum (hint: use the
mean,max()andmin()functions) life expectancy by continent. Your result should be like the one below.
solution:
#> # A tibble: 5 x 5
#> continent n mean maximum minimum
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 Africa 624 48.9 76.4 23.6
#> 2 Americas 300 64.7 80.7 37.6
#> 3 Asia 396 60.1 82.6 28.8
#> 4 Europe 360 71.9 81.8 43.6
#> 5 Oceania 24 74.3 81.2 69.1
Using data visualization is a great way to get acquinted with your data and sometimes it makes more sense than looking at large tables. In this section we get into the ggplot2 package which we’ll use throughout the class. It is the cutting edge of R’s data visualization toolset (not just in academia, but in business and data journalism as well).
ggplot2We will spend most of our time using ggplot2 for visualizing in the class and I would personally encourage the course participants to stick to ggplot2. If for some reason you would like a non ggplot way of plotting in R, there is a section on base R plotting at the end of this notebook.
The name stands for grammar of graphics and it enables you to build your plot layer by layer and having the ability to control every detail of the output (if you so wish). It is used by many in academia, by Uber, StackOverflow, AirBnB, the Financial Times, BBC and FiveThirtyEight writers, among many others.
You create plots with the below syntax:
Source: Kieran, Healy. Data Visualisation: A Practical Introduction. PRINCETON University Press, 2018. (Ch.3)
To have some idea about our variables, lets plot them on a histogram. First, we examine the GDP per capita variable from our gapminder dataset. To this, we just use the geom_histogram() function of ggplot2. It gives a bare-bones histogram of the (frequency distribution of our choosen continous variable) of the choosen variable.
Let’s create the foundation of our plot by specifying for ggplot the data we use and the variable we want to plot.
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap))We need to specify what sort of shape we want our data to be displayed. We can do this by adding the geom_histogram() function with a +
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap)) +
geom_histogram()Looks a little bit skewed. Let’s log transform our variable with the scale_x_log10() function.
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap)) +
geom_histogram() +
scale_x_log10()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.As the message says, we can mess around with the binwidth argument, so let’s do that.
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap)) +
geom_histogram(binwidth = 0.05) +
scale_x_log10()Of course if one prefers a boxplot, that is possible as well. We will check how life expectancy varies between and within continents. We’ll use geom_boxplot(). In this approach, we create an object for our plot. You don’t need to do this (there are instances where it is useful), but this shows you that just about everything can be an object in R
p_box <- ggplot(data = gapminder_df,
mapping = aes(x = continent,
y = lifeExp)) +
geom_boxplot()
p_boxInterpretation of the box plot is that the following. The box contains 50% of the values, the whiskers are the minimum and maximum values without the outliers, the line inside the box is the median. The upper and lower edges of the box are the first and third quartiles, respectively.
In visual form:
To use barplot instead we just simply switch to the geom_bar().
ggplot(data = gapminder_df,
mapping = aes(x = continent,
y = lifeExp)) +
geom_bar()
#> Error: stat_count() must not be used with a y aesthetic.Ooops. ggplot’s geom_bar() wants to carry out a counting excercise that is not able to run if we have a y variable specified. We can solve this in two ways. First, let’s tell ggplot that it should not do any additional calculations, by specifying it with the stat = "identity".
# using stat = "identity"
ggplot(data = gapminder_df,
mapping = aes(x = continent, y = lifeExp)) +
geom_bar(stat = "identity")Second, we can use the geom_col geom, which already assumes the “identity” argument.
ggplot(data = gapminder_df,
mapping = aes(x = continent,
y = lifeExp)) +
geom_col()Let’s use the gapminder dataset we have loaded and investigate the life expectancy and gdp per capita variables. We’ll use the geom_point() argument which we join to the p1 object with a +.
In p1 we specify the data source and the variables we want to plot. To this object we can attach the desired representation style with the geom_ functions.
p1 <- ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap,
y = lifeExp))
p1 + geom_point()Let’s refine this plot slightly: add labels, title, caption, and also transform the GDP variable. (plus some other minor cosmetics)
Check the comments in the code snippet to see what each line does!
p1 + geom_point(alpha = 0.25) + # inside the geom_ we can modify its attributes. Here we set the transparency levels of the points
scale_x_log10() + # rescale our x axis
labs(x = "GDP per capita",
y = "Life expectancy",
title = "Connection between GDP and Life expectancy",
subtitle = "Points are country-years",
caption = "Source: Gapminder")So far so good. With some minor additions the plot looks all right. But what if we want to see how each continent fares in this relationship? We need to change the p1 object to include a new argument in the mapping function: color = variable. Now it is clear that European countries (country-years) are clustered in the high-GDP/high life longevity upper right corner.
p1_grouped <- ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent)) # this is where we specify that we want to color the data by continents.
p1_grouped + geom_point(alpha = 0.75) + # inside the geom_ we can modify its attributes. Here we set the transparency levels of the points
scale_x_log10() + # rescale our x axis
labs(x = "GDP per capita (log $)",
y = "Life expectancy",
title = "Connection between GDP and Life expectancy",
subtitle = "Points are country-years",
caption = "Source: Gapminder dataset")When we are done with our nice figure, we can save it as well. I’d suggest to always save with code, and never from the “plots” pane on the right.
ggsave("gapminder_scatter.png", dpi = 600) # the higher the dpi, the smoother your plot'll look like.We can see how life expectancy changed in Mexico, Afghanistan, Sudan and Slovenia by using the geom_line() geom. For this, we create a new dataset by subsetting the gapminder one. The %in% operator does the same thing as the == but for multiple values. For subsetting we use the dplyr::filter() function. Don’t worry if this sounds too much, we will spend a whole session on how to subset and clean our data.
#subset the dataset to have our selected countries.
comp_df <- gapminder_df %>%
filter(country %in% c("Mexico", "Afghanistan", "Sudan", "Slovenia"))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
# create the ggplot object with the data and mapping info
ggplot(data = comp_df,
mapping = aes(x = year,
y = lifeExp,
color = country)) +
geom_line(aes(group = country)) # we need to tell ggplot that we want to group our lines by countriesggplot2 makes it easy to create individual subplots for each category by “faceting” our data. Let’s plot the growth in life expectancy over time on each continent. We use the geom_line() function to draw a line and we tell ggplot to facet by adding the facet_wrap(~ variable) function.
ggplot(data = gapminder_df,
mapping = aes(x = year,
y = lifeExp)) +
geom_line(aes(group = country)) + # we need to tell ggplot that we want to group our lines by countries
facet_wrap(~ continent) # create a small graph for each continentTo have some idea about our variables, lets plot them on a histogram. First, we examine the GDP per capita variable from our gapminder dataset. TO this, we just use the hist() function. It gives a bare-bones plot of the histogram (frequency distribution of our choosen continous variable)
# histogram indicates that our data is not normally distributed.
hist(gapminder_df$gdpPercap)
# maybe transforming it would help
hist(log(gapminder_df$gdpPercap))If the plot is shared with some collaborators, maybe some formatting could help. We can change the the ‘breaks’ to have a more granular view of our data. To turn off scientific notation, let’s use the options(scipen=5).
hist(gapminder_df$gdpPercap, breaks = 30)options(scipen=5)
# you can also control the breakpoints beyond giving it a simple value.
hist(gapminder_df$gdpPercap, breaks = seq(0, 120000, 1500))For categorical variable, we should use a barplot. To see how many observation we have for each continent, let’s use the barplot. We can use the table() function to see how many observations there are in each category and then plot that object.
cont_table <- table(gapminder_df$continent)
barplot(cont_table, main = "Number of observations by continents")For continuous data, we can use the generic plot() function.
# to check how life expectancy changed in Mexico, we can subset our dataset and then plot the result. More on subsetting tomorrow!
mexico <- subset(gapminder_df, country == "Mexico")
# the first argument of the plot is the x axis, second is the y axis.
plot(mexico$year, mexico$lifeExp)Something is not quite right…
# check how we can change the dots to a line.
# ?plot
# We should add `type`. Also we can structure our function to be a bit more readable.
plot(mexico$year, mexico$lifeExp,
xlab = "Year",
ylab = "Life expectancy (Years)",
col = "orange",
type = "l")plot(log(gapminder_df$gdpPercap),
gapminder_df$lifeExp,
xlab = "Log GDP per capita",
ylab = "Life expectancy",
main = "Relationship between GDP and life expectancy",
type = "p")To see how each variable is associated with the other, we can use the pair() function. It will create a scatter plot matrix, where the variable name is on the diagonal. In the rows the vertical axis is the variable indicated in the diagonal and in the columns the horizontal axis is the variable indicated in the diagonal.
pairs(gapminder_df[,c("lifeExp", "pop", "gdpPercap")])We can make the plots a little nicer, by adjusting the points. (this can be done for the previous scatterplots as well!). pch = 19 selects a point type which is filled, then with col = rgb() we can specify the exact color that we are looking for. The alpha argument regulates the transparency of the points.
source: http://kktg.net/sgr/wp-content/uploads/2014/02/fig-15-3-pch-values.png
pairs(gapminder_df[,c("lifeExp", "pop", "gdpPercap")],
pch=19,
col=rgb(0,0,0, alpha=0.1))The formula for the boxplot is instructing R to create a boxplot for each continent for the numerical variable of lifeExp. In general term, the formula is y ~ group.
boxplot(lifeExp ~ continent, data = gapminder_df,
main = "Distribution of life expectancy by continent",
xlab = "Continents",
ylab = "Life expectancy")We can also do a grouped bar plot. For this we will use the star wars dataset and do a grouped bar plot of the eye color and gender in star wars movies. The logic is the same as for the barplot with a single variable, but now we add two for the table() function. We also add a legend to our plot with the legend() function.
# grouped bar plot
# load the data
sw <- starwars
# create the table with the two variables of interest
sw_table <- table(sw$gender, sw$eye_color)
sw_table
#>
#> black blue blue-gray brown dark gold green, yellow hazel
#> female 2 6 0 5 0 0 0 2
#> hermaphrodite 0 0 0 0 0 0 0 0
#> male 7 13 1 16 1 1 1 1
#> none 1 0 0 0 0 0 0 0
#>
#> orange pink red red, blue unknown white yellow
#> female 0 0 0 1 1 1 1
#> hermaphrodite 1 0 0 0 0 0 0
#> male 7 1 2 0 2 0 9
#> none 0 0 1 0 0 0 0
# plot the grouped barplot. Mind the `beside` argument! We can also add a legend, so the colors are straightforward.
barplot(sw_table,
beside = TRUE,
col = c("orange", "brown", "green", "black"))
legend(x = "topright",
legend = c("Female", "Hermaphrodite", "Male", "None"),
fill = c("orange", "brown", "green", "black"),
cex = 0.7)If you want to have the proportional numbers, we can use the prop.table() function. If we use the sw_table table as the only argument, the we will get proportions of the total. If we use 1 as the second argument, we get proportions of rows and if we use 2, we get proportions of columns. Remember, rows come first and columns second!